Created: 2026-03-06 07:53:04
Updated: 2026-03-06 07:53:04

12.1 The Method of Types

AEP是我们可以关注于一小部分典型序列,类型方法使我们可以考虑具有相同经验分布(empirical distribution)的序列。在该限制下,我们可以推出特定经验分布中序列个数的bounds,以及这个集合中每个序列的概率。

序列x1,,xnx_{1},\dots,x_{n}类型 PxP_{\mathbf{x}}H\mathscr{H}中每个符号在序列x\mathbf{x}出现的相对比例:Px(a)=N(ax)n,aHP_{\mathbf{x}}(a) = \frac{N(a\mid \mathbf{x})}{n}, \forall a\in \mathscr{H},其中N(ax)N(a\mid \mathbf{x})是符号a出现在xHn\mathbf{x}\in\mathscr{H}^n中的个数。

定义Pn\mathscr{P}_{n}是分母为n的所有类型集合。
例如,H={0,1}\mathscr{H}=\{0,1\},此时所有类型共n+1n+1个,即:

Pn={(P(0),P(1)):(0n,nn),(1n,n1n),,(nn,0n)}\mathscr{P_{n}} = \left\{(P(0),P(1)): \left( \frac{0}{n} , \frac{n}{n}\right), \left( \frac{1}{n}, \frac{n-1}{n} \right), \dots, \left( \frac{n}{n}, \frac{0}{n} \right) \right\}

定义:若PPnP\in \mathscr{P}_{n}, 那么全体具有长度n、类型P的序列集合称作P的类型类,记作T(P)T(P):

T(P)={xHn:Px=P}T(P)= \left\{\mathbf{x}\in \mathscr{H}^n : P_{\mathbf{x}} = P\right\}

例如,x=11321,H={1,2,3},Px(1)=35,Px(2)=15,Px(3)=15x=11321, \mathscr{H}=\{1,2,3\},P_{\mathbf{x}}(1)=\frac{3}{5},P_{\mathbf{x}}(2)=\frac{1}{5},P_{\mathbf{x}}(3) = \frac{1}{5}
Number of T(P):

T(P)=5!3!1!1!=20|T(P)| = \frac{5!}{3!1!1!}=20

Theorem 12.1.1:

Pn(n+1)H|\mathscr{P}_{n}| \leq (n+1)^{|\mathscr{H}|}

证明:实质上这个值是H| \mathscr{H}|个有序自然数和为n的个数。每个数值最多有n+1n+1种取值。

Theorem 12.1.2:
X1,,XnX_{1},\dots,X_{n}是独立同分布的,概率分布为Q(x)Q(x),则x\mathbf{x}的概率仅取决于其类型P,由

Qn(x)=2n(H(Px)+D(PxQ))Q^n(\mathbf{x}) = 2^{-n(H(P_{\mathbf{x}})+D(P_{\mathbf{x}}\mid\mid Q))}

给出。
说明:x\mathbf{x}是我们希望考察的变量,类型为P,Q是实际的概率分布。 2nH(Q)2^{-nH(Q)}是typical seq中每个序列x\mathbf{x}的典型概率(AEP),当P=QP=Q时概率能取到”最大值“

推论:若x\mathbf{x}QQ的类型类中,那么Qn(x)=2nH(Q)Q^{n}(\mathbf{x})=2^{-nH(Q)}。这是因为xT(Q)    Px=Q\mathbf{x}\in T(Q)\implies P_{x}=Q
Theorem 12.1.3:
Size of type class T(P)T(P): 对任意类型PPnP\in \mathscr{P}_{n},

1(n+1)H2nH(p)T(P)2nH(P)\frac{1}{(n+1)^{\mid\mathscr{H}\mid}} 2^{nH(p)} \leq |T(P)| \leq 2^{nH(P)}

从而每个type TT发生的概率为

2nD(PxQ)(n+1)HT(Px)Qn(x)2nD(PxQ)\frac{2^{-nD(P_{\mathbf{x}}\mid\mid Q)}}{(n+1)^{|\mathscr{H}|}}\leq|T(P_{\mathbf{x}})| Q^n(\mathbf{x})\leq 2^{-nD(P_{\mathbf{x}}\mid\mid Q)}

上面这个公式在估计偏离统计值较大的情形下很有用,能给出比大数定律更好的估计结果。

确切的表达式为:

T(P)=n!(nP(a1))!(nP(a2))!(nP(aH))!|T(P)| = \frac{n!}{(nP(a_{1}))!(nP(a_{2}))!\dots (nP(a_{|\mathscr{H}|}))!}

证明:

1Pn(T(P))=xT(P)Pn(x)=xT(P)2nH(P)=T(P)2nH(P)\begin{align} 1 & \geq P^n (T(P)) \\ & = \sum_{\mathbf{x}\in T(P)} P^n(\mathbf{x})\\ & = \sum_{\mathbf{x}\in T(P)} 2^{-nH(P)} \\ & = |T(P)| 2^{-nH(P)} \end{align}

从而T(P)2nH(P)|T(P)| \leq 2^{nH(P)}

证明上界前,现证明type classT(P)T(P)在分布P下,在所有type classes中具有最高的概率:

Pn(T(P))Pn(T(P^)),P^PnP^n(T(P)) \geq P^n (T(\hat{P})), \forall \hat{P} \in \mathscr{P}_{n}

Pn(T(P))Pn(T(P^))=T(P)ΠaHP(a)nP(a)T(P^)ΠaHP(a)nP^(a)=(nnP(a1),nP(a2),,nP(aH))aHP(a)nP(a)(nnP^(a1),nP^(a2),,nP^(aH))aHP(a)nP^(a)=aH(nP^(a))!(nP(a))!P(a)n(P(a)P^(a))\begin{aligned} & \frac{P^n(T(P))}{P^n(T(\hat{P}))}=\frac{|T(P)| \Pi_{a \in \mathscr{H}} P(a)^{n P(a)}}{|T(\hat{P})| \Pi_{a \in \mathscr{H}} P(a)^{n \hat{P}(a)}} \\ & =\frac{\left( \begin{matrix} n \\ n P\left(a_1\right), n P\left(a_2\right), \ldots, n P\left(a_{|\mathscr{H}|}\right) \end{matrix} \right) \prod_{a \in \mathscr{H}} P(a)^{n P(a)}}{\left(\begin{array}{c} n\\ n \hat{P}\left(a_1\right), n \hat{P}\left(a_2\right), \ldots, n \hat{P}\left(a_{|\mathscr{H}|}\right) \end{array}\right) \prod_{a \in \mathscr{H}} P(a)^{n \hat{P}(a)}} \\ & =\prod_{a \in \mathscr{H}} \frac{(n \hat{P}(a)) !}{(n P(a)) !} P(a)^{n(P(a)-\hat{P}(a))} \\ & \end{aligned}

由于m!n!nmn\frac{m!}{n!}\geq n^{m-n}

Pn(T(P))Pn(T(P^))aH(nP(a))nP^(a)nP(a)P(a)n(P(a)P^(a))=aHnn(P^(a)P(a))=nn(ΣaHP^(a)ΣaHP(a))=nn(11)=1\begin{aligned} \frac{P^n(T(P))}{P^n(T(\hat{P}))} & \geq \prod_{a \in \mathscr{H}}(n P(a))^{n \hat{P}(a)-n P(a)} P(a)^{n(P(a)-\hat{P}(a))} \\ & =\prod_{a \in \mathscr{H}} n^{n(\hat{P}(a)-P(a))} \\ & =n^{n\left(\Sigma_{a \in \mathscr{H}} \hat{P}(a)-\Sigma_{a \in \mathscr{H}} P(a)\right)} \\ & =n^{n(1-1)} \\ & =1 \end{aligned}

于是

1=QPnPn(T(Q))QPnmaxQPn(T(Q))=QPnT(P)(n+1)HPn(T(P))=(n+1)HxT(P)Pn(x)=(n+1)HxT(P)2nH(P)=(n+1)HT(P)2nH(P)\begin{align} 1 & = \sum_{Q\in \mathscr{P}_{n}} P^n (T(Q)) \\ & \leq \sum_{Q\in \mathscr{P}_{n}} \max _{Q} P^n (T(Q)) \\ & = \sum_{Q\in \mathscr{P}_{n}} T(P) \\ & \leq (n+1)^{|\mathscr{H}|} P^n (T(P)) \\ & = (n+1)^{\mathscr{H}} \sum_{x\in T(P)} P^n (\mathbf{x}) \\ & = (n+1)^{\mathscr{H}} \sum_{x\in T(P)} 2^{-nH(P)} \\ & = (n+1)^{\mathscr{H}} |T(P)| 2^{-nH(P)} \end{align}

12.4 Large Deviation Theory

Let EE be a subset of the set of probability mass functions. For example, EE may be the set of probability mass functions with mean μ\mu. With a slight abuse of notation, we write

Qn(E)=Qn(EPn=x:PxEPn)Qn(x)Q^n(E) = Q^n\left( E \cap \mathscr{P}_{n} = \sum_{\mathbf{x}: P_{\mathbf{x}} \in E \cap \mathscr{P}_{n}} \right) Q^n(\mathbf{x})

If EE contains a relative entropy neighborhood of QQ , then by the weak law of large numbers, Qn(E)1Q^n (E) \to 1. On the other hand, if EE does not contain QQ or a neighborhood of QQ, then by the weak law of large numbers, Qn(E)0Q^n(E)\to0 exponentially fast. We will use the method of types to calculate the exponent.
例子:假设我们观测到样本的g(X)g(X)均值大于等于α\alpha:

1nig(Xi)α\frac{1}{n}\sum_{i} g(X_{i}) \geq \alpha

这等价于PxEPnP_{x}\in E\cap \mathscr{P}_{n},其中

E={P:aHg(a)P(a)α}E= \left\{P: \sum_{a\in \mathscr{H}} g(a)P(a)\geq \alpha \right\}

Sanov's theorem: Let X1,X2,,XnX_{1},X_{2},\dots,X_{n} be i.i.d. ~Q(x)Q(x), let EPE\subset \mathscr{P} be a set of probability distributions. Then

Qn(E)=Qn(EPn)(n+1)H2nD(PQ)Q^n(E) =Q^n(E\cap \mathscr{P}_{n} ) \leq (n+1)^{\left| \mathscr{H} \right| } 2^{-nD(P^* \mid\mid Q)}

Where

P=argminPED(PQ)P^* = arg\min _{P\in E} D(P\mid\mid Q)

is the distribution in EE that is closest to QQ in relative entropy. If, in addition, the set E is the closure of its interior, then

1nlogQn(E)D(PQ)\frac{1}{n} \log Q^n (E) \to -D(P^* \mid\mid Q)

Examples: suppose we wish to find Pr{1ni=1ngj(Xi)αj,j=1,2,,k}\text{Pr}\{\frac{1}{n} \sum_{i=1}^n g_{j}(X_{i})\geq \alpha_{j},j=1,2,\dots,k\}
To find the closest distribution in EE to QQ,use Lagrange multipliers, we construct functional:

J(P)=xP(x)logP(x)Q(x)+jλixP(x)gi(x)+νxP(x)J(P) = \sum_{x}P(x) \log \frac{P(x)}{Q(x)} + \sum_{j} \lambda_{i} \sum_{x} P(x) g_{i}(x) + \nu \sum_{x} P(x)

Differentiate, calculate the closest distribution to Q:

P(x)Q(x)eiλigi(x)P^*(x) \propto Q(x)e^{\sum_{i}\lambda_{i}g_{i}(x)}

If Q is uniform, then PP^* is the maximum entropy distribution.
例:投色子n次,平均值大于等于4的概率大约是多少?

Qn(E)=˙2nD(PQ)Q^n(E) \dot{=} 2^{-nD(P^* \mid\mid Q)}

其中PP^*分布满足i=16iPi4\sum_{i=1}^6 i P_{i}\geq 4D(PQ)D(P\mid\mid Q)最小。从上面的讨论得知PP^*具有形式

P(x)=2λxi=162λiP^*(x) = \frac{2^{\lambda x}}{\sum_{i=1}^6 2^{\lambda i}}

其中λ\lambda使得不等号正好取等。数值求解得到

P=(0.1031,0.1227,0.1461,0.1740,0.2072,0.2468)D(PQ)=0.0624 bits \begin{align} P^* & = (0.1031,0.1227,0.1461,0.1740,0.2072,0.2468) \\ D(P^*\mid\mid Q) & = 0.0624 \text{ bits } \end{align}

投掷色子n次,概率大于等于4的概率约为20.0624n2^{-0.0624 n}

例2:投掷硬币1000次,大于等于700次正面向上的概率:P=(0.7,0.3)P^*=(0.7,0.3)

D(PQ)=1H(P)=0.119D(P^*\mid\mid Q) = 1-H(P^*) = 0.119

P(Xˉn0.7)=˙2nD(PQ)P(\bar{X}_{n}\geq 0.7) \dot{=} 2^{-nD(P^*\mid\mid Q)}

例3 Mutual dependence: Let Q(x,y)Q(x,y) be a given joint distribution and let Q0(x,y)=Q(x)Q(y)Q_{0}(x,y)=Q(x)Q(y) be the associated product distribution from the marginals of QQ. We wish to know the likelihood that a sample drawn according to Q0Q_{0} will appear to be jointly distributed according to QQ. Accordingly, let (Xi,Yi)(X_{i},Y_{i}) be i.i.d. Q0(x,y)\sim Q_{0}(x,y). Define joint typicality: (xn,yn)(x^n,y^n) is jointly typical with respect to a joint distribution Q(x,y)Q(x,y) iff the sample entropies are close to their true values,

1nlogQ(xn)H(X)ϵ1nlogQ(yn)H(Y)ϵ\begin{align} |-\frac{1}{n} \log Q(x^n)-H(X)|\leq\epsilon \\ |-\frac{1}{n} \log Q(y^n)-H(Y)|\leq\epsilon \\ \end{align}

and

1nlogQ(xn,yn)H(X,Y)ϵ|-\frac{1}{n} \log Q(x^n,y^n ) - H(X,Y) | \leq \epsilon

我们希望计算一对看到一对jointly typical of
Q的(xn,yn)(x^n,y^n)的概率。因此(xn,yn)(x^n,y^n) 是基于Q(x,y)Q(x,y) jointly typical 的,如果Pxn,ynEPn(X,Y)P_{x^n,y^n}\in E \cap \mathscr{P}_{n}(X,Y),其中

E={P(x,y):x,yP(x,y)logQ(x)H(X)ϵ,x,yP(x,y)logQ(y)H(Y)ϵ,x,yP(x,y)logQ(x,y)H(X,Y)ϵ}.\begin{aligned} & E=\left\{P(x, y):\left|-\sum_{x, y} P(x, y) \log Q(x)-H(X)\right| \leq \epsilon,\right. \\ &\left|-\sum_{x, y} P(x, y) \log Q(y)-H(Y)\right| \leq \epsilon, \\ &\left.\left|-\sum_{x, y} P(x, y) \log Q(x, y)-H(X, Y)\right| \leq \epsilon\right\} . \end{aligned}

由Sanov 定理,概率为

Qnn(E)=˙2nD(PQ0)Q_{n}^n(E) \dot{=} 2^{-nD(P^* \mid\mid Q_{0})}

其中PP^*为相对Q0Q_{0}熵最小的满足条件的分布。在此问题中为联合分布QQ。而Q0Q_{0}为直积分布,因此上式为2nD(Q(x,y)Q(x)Q(y))=2nI(X;Y)2^{-nD(Q(x,y)\mid\mid Q(x)Q(y))}=2^{-nI(X;Y)},得到的结果与AEP中相同。

12.6 The Conditional Limit Theorem

我们已经说明了,一组类型在分布QQ下出现的概率由集合中最靠近QQ的元素决定,这个概率的一阶项为2nD2^{-nD^*},其中

D=minPED(PQ)D^* = \min _{P\in E} D(P\mid\mid Q)

这是由于集合的概率为每个元素的概率之和,因此就由这些项中最大的项为界。由于这些项的个数是序列长度的多项式级别,求和的最高阶就等于其中最大的项。

现在我们将说明,
Theorem 12.6.1: For a closed convex set EPE\subset \mathscr{P} and distribution QEQ\notin E, let PEP^* \in E be the distribution that achieves the minimum distance to QQ. Then

D(PQ)D(PP)+D(PQ)D(P\mid\mid Q)\geq D(P\mid\mid P^*)+D(P^*\mid\mid Q)

for all PEP\in E

Proof: Let PE,Pλ=λP+(1λ)PP\in E, P_{\lambda} = \lambda P + (1-\lambda)P^*

Dλ=D(PλQ)=xPλlogPλ(x)Q(x)D_{\lambda} = D(P_{\lambda}\mid\mid Q) = \sum_{x} P_{\lambda} \log \frac{P_{\lambda}(x)}{Q(x)}

derivative dDλdλ0\frac{dD_{\lambda}}{d\lambda}\geq 0, so

0(dDλdλ)λ=0=x(P(x)P(x))logP(x)Q(x)=xP(x)logP(x)Q(x)xP(x)logP(x)Q(x)=P(x)logP(x)Q(x)P(x)P(x)P(x)logP(x)Q(x)=D(PQ)D(PP)D(PQ)\begin{align} 0 & \leq \left( \frac{dD_{\lambda}}{d\lambda} \right)_{\lambda=0} \\ & = \sum_{x} (P(x)-P^*(x)) \log \frac{P^*(x)}{Q(x)} \\ & = \sum_{x} P(x)\log \frac{P^*(x)}{Q(x)} - \sum_{x} P^*(x)\log \frac{P^*(x)}{Q(x)} \\ & = \sum P(x) \log \frac{P(x)}{Q(x)} \frac{P^*(x)}{P(x)} - \sum P^*(x) \log \frac{P^*(x)}{Q(x)} \\ & = D(P\mid\mid Q) - D(P\mid\mid P^*) - D(P^* \mid\mid Q)\qquad\square \end{align}

Note that the relative entropy D(PQ)D(P\mid\mid Q) behaves like the square of the Euclidean distance.

Theorem 12.6.2(Conditional limit theorem): Let EE be a closed convex subset of P\mathscr{P} and let QQ be a distribution not in EE. Let X1,X2,,XnX_{1},X_{2},\dots,X_{n} be discrete random variables drawn i.i.d Q\sim Q. Let PP^* achieve minPED(PQ)\min_{P\in E} D(P\mid\mid Q). Then

Pr(X1=aPXnE)P(a)\text{Pr} (X_{1}=a\mid P_{X^n} \in E) \to P^* (a)

in probability as nn\to \infty, i.e. the conditional distribution of X1X_{1}, given that the type of sequence is in EE, is close to PP^* for large n.

Leave a Comment

captcha
Fontsize